Optimized hardward architecture and method for ECC point addition using mixed affine-jacobian coordinates over short weierstrass curves

ABSTRACT

An optimized hardware architecture and method introducing a simple arithmetic processor that allows efficient implementation of an Elliptic Curve Cryptography point addition algorithm for mixed Affine-Jacobian coordinates. The optimized architecture additionally reduces the required storage for intermediate values.

BACKGROUND

Electronic devices are becoming a ubiquitous part of everyday life. Thenumber of smartphones and personal tablet computers in use is rapidlygrowing. A side effect of the increasing use of smartphones and personaltablets is that increasingly the device are used for storingconfidential data such as personal and banking data. Protection of thisdata against theft is of paramount importance.

The field of cryptography offers protection tools for keeping thisconfidential data safe. Based on hard to solve mathematical problems,cryptography typically requires highly computationally intensivecalculations that are the main barrier to wider application in cloud andubiquitous computing (ubicomp). If cryptographic operations cannot beperformed quickly enough, cryptography tools are typically not acceptedfor use on the Internet. In order to be transparent while stillproviding security and data integrity, cryptographic tools need tofollow trends driven by the need for high speed and the low powerconsumption needed in mobile applications.

Public key algorithms are typically the most computationally intensivecalculations in cryptography. For example, take the case of EllipticCurve Cryptography (ECC), one of the most computationally efficientpublic key algorithms. The 256 bit version of ECC provides security thatis equivalent to a 128 bit symmetric key. A 256 bit ECC public keyshould provide comparable security to a 3072 bit RSA public key. Thefundamental operation of ECC is a point multiplication which is anoperation heavily based on modular multiplication, i.e. approximately3500 modular multiplications of 256 bit integers are needed forperforming one ECC 256 point multiplication. Higher security levels(larger bit integers) require even more computational effort.

Building an efficient implementation of ECC is typically non-trivial andinvolves multiple stages. FIG. 1 illustrates stages 101, 102 and 103that are needed to realize the Elliptical Curve Digital SignatureAlgorithm (ECDSA), which is one of the applications of ECC. Stage 101deals with finite field arithmetic that comprises modular addition,inversion and multiplication. Stage 102 deals with point addition andpoint doubling which comprises the Joint Sparse Form (JSF), Non-AdjacentForm (NAF), windowing and projective coordinates. Finally, stage 103deals with the ECDSA and the acceptance or rejection of the digitalsignature.

Any elliptic curve can be written as a plane geometric curve defined bythe equation of the form (assuming the characteristic of the coefficientfield is not equal to 2 or 3):y ² =x ³ +ax+b  (1)that is non-singular; that is it has no cusps or self-intersections andis known as the short Weierstrass form where a and b are integers. Thecase where a=−3 is typically used in several standards such as thosepublished by NIST, SEC and ANSI which makes this the case of typicalinterest.

Many algorithms have been proposed in the literature for efficientimplementation of the Point Addition (PADD) and Point Doubling (PDBL)operations. Many of these algorithms are optimized for softwareimplementation. While these are typically efficient on certainplatforms, the algorithms are typically not optimal once the underlyinghardware can be tailored to the algorithm.

A PADD algorithm for mixed affine-Jacobian coordinates has beendescribed by Cohen, Miyaji and Ono in Proceedings of the InternationalConference on the Theory and Applications of Cryptography andInformation Security; Advances in Cryptology, ASIACRYPT 1998, pages51-65, Springer-Verlag, 1998. Jacobian coordinates are projectivecoordinates where each point is represented as three coordinates (X, Y,Z) where x=X/Z², y=Y/Z³ and affine coordinates are the familiar (x,y)coordinates. Note the coordinates are all integers. PADD algorithm 200requires 8 modular multiplications, 3 modular squarings, 6 modularsubtractions, and one modular multiplication by 2 and is shown in FIG.2. In order to perform the PADD, the algorithm further requires aminimum of 4 temporary registers, which for ECC 256 bit each need to be256 bits in size. All operations are done in the finite field K overwhich the elliptic curve E is defined. The finite arithmetic field K isdefined over the prime number p so that all arithmetic operations areperformed modulo p. The additive identity element is the point atinfinity.

SUMMARY

An optimized hardware architecture and method reduces storagerequirements and speeds up the execution of the ECC PADD algorithm byrequiring only two temporary storage registers and by introducing asimple arithmetic unit for performing modular subtraction and modularmultiplication by 2.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 shows stages 101, 102 and 103 that are needed to realize theElliptical Curve Digital Signature Algorithm (ECDSA).

FIG. 2 shows a prior art point addition algorithm.

FIG. 3 shows an embodiment in accordance with the invention.

FIG. 4 show an embodiment in accordance with the invention.

FIG. 5 shows an embodiment in accordance with the invention.

FIG. 6 shows an embodiment in accordance with the invention.

FIG. 7 shows an embodiment in accordance with the invention.

DETAILED DESCRIPTION

PADD algorithm 300 in accordance with the invention is shown in FIG. 3.PADD algorithm 300 requires fewer steps and reduces the storagerequirements compared to PADD algorithm 200 for the same modularaddition of two points. PADD algorithm 300 requires only two temporarystorage registers, T₁ and T₂. Note, PADD algorithm 300 performs modularpoint addition using mixed affine-Jacobian coordinates to avoid the needfor a modular inversion operation that is typically one to two orders ofmagnitude slower than a modular multiplication operation. The use ofmixed coordinates provides a speed advantage over performing the pointaddition solely in Jacobian coordinates that also obviates the need fora modular inversion operation. PADD algorithm 300 is implemented over anoptimized hardware architecture shown in FIG. 6 and FIG. 7 andspecifically designed to take advantage of PADD algorithm 300.

As input in step 301, PADD algorithm 300 shown in FIG. 3 takes pointP=(X₁, Y₁, Z₁) in Jacobian coordinates and point Q=(x₂, y₂) in affinecoordinates as the two points to be added together as P+Q. T₁ and T₂ aretemporary storage variables. Note that all mathematical operations shownare in modular arithmetic. In step 302 of PADD algorithm 300, the valueof point P is returned as the result of the modular addition of P+Q ifQ=∞, as a point at infinity is the identity element. Similarly, in step303, the value of point Q is returned as the result of the modularaddition of P+Q if P=∞, as a point at infinity is the additive identityelement. In step 304, the Jacobian coordinate Z₁ is squared and theresulting value stored in temporary register T₁. In step 305, Z₁*T₁ iscalculated and the resulting value stored in temporary register T₂. Instep 306, T₂*y₂−Y₁ is calculated, where y₂ is in affine coordinates andY₁ is in Jacobian coordinates, the result being stored in temporaryregister T₂. In step 307, the value stored in temporary register T₁ ismultiplied by x₂ and X₁ is then subtracted from the result, where x₂ isin affine coordinates and X₁ is in Jacobian coordinates, the resultbeing stored in temporary register T₁. Step 308 provides for a return ifT₁ and T₂ are both zero as this means P=Q and step 309 provides for areturn if T₁ is zero and T₂ is not zero as this means P=−Q. In step 310,the Jacobian coordinate Z₁ is multiplied by the value in temporaryregister T₁ and the result is stored as Jacobian coordinate Z₃. In step311, the value stored in temporary register T₁ is squared and stored asJacobian coordinate Y₃. In step 312, the value stored in temporaryregister T₂ is squared and stored as Jacobian coordinate X₃. In step313, Y₃*T₁ is calculated and the result is stored in temporary registerT₁. In step 314, T₁+2Y₃*X₁ is calculated and subtracted from Jacobiancoordinate X₃ with the result stored as Jacobian coordinate X₃. In step315, Y₃*X₁−X₃ is calculated and multiplied by T₂ and stored as Jacobiancoordinate Y₃. Note that Y₃*X₁ was calculated in step 314 and that valueis used in step 315 and is not calculated again in step 315. In step316, T₁*Y₁ is calculated and subtracted from Jacobian coordinate Y₃ andthe result is stored as Jacobian coordinate Y₃. Finally, in step 317 theresult of the point addition of P+Q: (X₃, Y₃, Z₃) is returned inJacobian coordinates.

The most computationally intensive operation in PADD algorithm 300 inFIG. 3 is modular multiplication denoted by “*”. Because most of thesteps described in PADD algorithm 300 depend on the previous steps ofthe algorithm, it is typically most efficient to implement PADDalgorithm 300 in hardware using a single modular multiplier althoughmore than one modular multiplier may be used in accordance with theinvention. Using only one modular multiplier restricts each step in PADDalgorithm 300 to having no more than one modular multiplication. Whilestep 315 appears to contain two modular multiplications, the result ofY₃*X₁ has already been calculated in step 314 and is fed in directlyinto the input of the hardware modular multiplier.

It is important to note that besides the modular multiplication stepsperformed in steps 306, 307, 314, 315 and 316 of PADD algorithm 300, twoadditional, comparatively simple operations are performed as well:modular subtraction and modular multiplication by 2. Note thatmultiplication or division by a power of 2 in binary is merely a shiftoperation. In order to speed up execution of PADD algorithm 300 andeliminate the need for additional temporary registers, an embodiment inaccordance with the invention of simple arithmetic unit (SAU) 400 withthe inputs and outputs as shown in FIG. 4.

FIG. 5 shows how steps 306, 307, 314, 315 and 316 of PADD 300 in FIG. 3are broken down for utilization of SAU 400 which has inputs A, B and Cwith output D. Note that the input and output labels of SAU 400correspond to the respective variable names in FIG. 5. Block 501 showshow step 306 of PADD algorithm 300 is broken down using SAU 400 andinvolves setting inputs A=T₂*y₂ and B=Y₁ with output D=A−B. Output D iswritten to temporary register T₂. Block 501 shows how step 307 of PADDalgorithm 300 is broken down using SAU 400 and involves setting inputsA=T₁*x₂ and B=X₁ with output D=A−B. Output D is then written totemporary register T₁. Block 503 shows how step 314 of PADD algorithm300 is broken down using SAU 400 and involves setting inputs A=X₃, B=T₁,C=y₂*X₁ with output D=A−B−2C. Output D is written to Jacobian coordinateX₃. Block 504 shows how step 315 of PADD algorithm 300 is broken downusing SAU 400 and involves setting inputs A=Y₃*X₁ and B=X₃ with outputD=A−B. Output D is written to Jacobian coordinate Y₃. Block 505 showshow step 316 of PADD algorithm 300 is broken down using SAU 400 andinvolves setting inputs A=Y₃ and B=T₁*Y₁ with output D=A−B. Output D iswritten to Jacobian coordinate Y₃. Note that “don't care” indicates thevalue is irrelevant to the calculation being performed in the respectivesteps.

FIG. 6 shows embodiment 600 in accordance with the invention comprisingmulti-cycle multiplier 610 with output register (not shown), SAU 400,multiplexer (MUX) 620 and MUX 630 with input registers X₁, Y₁, Z₁, x₂,y₂, output registers X₃, Y₃, Z₃ and temporary registers T₁ and T₂ thatare all part of register memory 695. Note the individual register labelscorrespond to variable names in FIGS. 3 and 5. MUX 620, 630 and 740(part of SAU 400, see FIG. 7) are controlled by the microprocessor (notshown) which executes PADD algorithm 300. As noted above, each step inPADD algorithm 300 involve at most one modular multiplication (notcounting multiplication or division by 2 which in binary representationis merely a shift operation).

SAU 400 shown in FIG. 7 comprises subtractors 710 and 720, logical onebit left shifter 715 and MUX 720. Input A connects to the minuend inputof subtractor 710 on line 670 and input B connects to the subtrahendinput of subtractor 710 on line 675. Input C connects to logical one bitleft shifter 715 on line 650 where logical one bit left shifter 715performs a multiplication of the input C by two. Subtractor 710 outputsA−B on line 730 which connects to the minuend input of subtractor 720and the “0” input for MUX 740. Logical one bit left shifter 715 outputs2C on line 735 to the subtrahend input of subtractor 720. Subtractor 720outputs A−B−2C on line 750 to the “1” input for MUX 740. MUX 740 sends Don line 690.

Multi-cycle multiplier 610 functions by multiplying the values on inputs635 and 640 together and outputting the result. Steps 301-303 areperformed in the microprocessor (not shown) without using multi-cyclemultiplier 610 and SAU 400.

Step 304 utilizes multi-cycle multiplier 610. Register memory 695provides Z₁ on both inputs 635 and 640 of multi-cycle multiplier 610 andmulti-cycle multiplier 610 computes Z₁ ² which is sent on line 650 toregister memory 695 and stored in temporary register T₁.

Step 305 utilizes multi-cycle multiplier 610. Register memory 695provides T₁ on input 635 and Z₁ on input 640 of multi-cycle multiplier610. Multi-cycle multiplier 610 computes T₁*Z₁ which is sent on line 650to register memory 695 where it is stored in temporary register T₂.

Step 306 utilizes both multi-cycle multiplier 610 and SAU 400. Registermemory 695 provides T₂ and y₂ on lines 635 and 640, respectively, tomulti-cycle multiplier 610. Multi-cycle multiplier 610 computes T₂*y₂which is output on line 650 to input “1” of MUX 620 with MUX 620 set to“1”. MUX 630 input is set to “0”. MUX 620 sends T₂*y₂ to input A of SAU400 on line 670. Line 670 is directly connected to the minuend input ofsubtractor 710. Register memory 695 provides Y₁ on line 660 to input “0”of MUX 630 and MUX 630 is set to “0”. MUX 630 sends Y₁ to input B of SAU400 on line 675. Line 675 is directly connected to the subtrahend inputof subtractor 710. Subtractor 710 computes A−B (which is T₂*y₂−Y₁) andoutputs A−B on line 730 to input “0” of MUX 740 with MUX 740 set to “0”.MUX 740 sends D (which is A−B) on line 690 to register memory 695 whereit is stored in temporary register T₂.

Step 307 utilizes both multi-cycle multiplier 610 and SAU 400. Registermemory 695 provides T₁ and x₂ on lines 635 and 640, respectively, tomulti-cycle multiplier 610. Multi-cycle multiplier 610 computes T₁*x₂which is output on line 650 to input “1” of MUX 620 with MUX 620 set to“1”. MUX 620 sends T₁*x₂ to input A of SAU 400 on line 670. Line 670 isdirectly connected to the minuend input of subtractor 710. Registermemory 695 provides X₁ on line 660 to input “0” of MUX 630 and MUX 630is set to “0”. MUX 630 sends X₁ to input B of SAU 400 on line 675. Line675 is directly connected to the subtrahend input of subtractor 710.Subtractor 710 computes A−B (which is T₁*x₂−X₁) and outputs A−B on line730 to input “0” of MUX 740 with MUX 740 set to “0”. MUX 740 sends D(which is A−B) on line 690 to register memory 695 where it is stored intemporary register T₁.

Steps 308-309 are performed in the microprocessor (not shown) withoutusing multi-cycle multiplier 610 and SAU 400.

Step 310 utilizes multi-cycle multiplier 610. Register memory 695provides T₁ on line 635 and Z₁ on line 640 to multi-cycle multiplier610. Multi-cycle multiplier 610 computes T₁*Z₁ and the result is outputon line 650 to register memory 695 where it is stored in temporaryregister T₂.

Step 311 utilizes multi-cycle multiplier 610. Register memory 695provides T₁ on both lines 635 and 640 to multi-cycle multiplier 610.Multi-cycle multiplier 610 computes T₁ ² and the result is output online 650 to register memory 695 Y₃ where it is stored in Y₃.

Step 312 utilizes multi-cycle multiplier 610. Register memory 695provides T₂ on both lines 635 and 640 to multi-cycle multiplier 610.Multi-cycle multiplier 610 computes T₂ ² and the result is output online 650 to register memory 695 where it is stored in X₃.

Step 313 utilizes multi-cycle multiplier 610. Register memory 695provides T₁ on line 635 and Y₃ on line 640 to multi-cycle multiplier610. Multi-cycle multiplier 610 computes T₁*Y₃ and the result is outputon line 650 to register memory 695 where it is stored in temporaryregister T₁.

Step 314 utilizes both multi-cycle multiplier 610 and SAU 400. Registermemory 695 provides X₃ on line 665 to input “0” of MUX 620 with MUX 620set to “0”. MUX 620 sends X₃ to input A of SAU 400 on line 670. Line 670is directly connected to the minuend input of subtractor 710. Registermemory 695 provides T₁ on line 660 to input “0” of MUX 630 with MUX 630set to “0”. MUX 630 sends T₁ on line 675 to input B of SAU 400. Line 650is directly connected to the subtrahend input of subtractor 710.Subtractor 710 computes and outputs A−B (which is X₃−T₁) on line 730 tothe minuend input of subtractor 720. Register memory 695 provides X₁ online 635 and Y₃ on line 640 to multi-cycle multiplier 610. Multi-cyclemultiplier 610 computes Y₃*X₁. The result is output on line 650 to inputC of SAU 400 which is directly connected to logical one bit left shifter715 which multiplies input C by two and outputs 2C (which is 2Y₃*X₁) online 735 to the subtrahend output of subtractor 720. Subtractor 720computes and outputs A−B−2C on line 750 to input “1” of MUX 740 with MUX740 set to “1”. MUX 740 sends D (which is A−B−2C=X₃−T₁−2Y₃*X₁) on line690 to register memory 695 where it is stored in X₃.

Step 315 utilizes both multi-cycle multiplier 610 and SAU 400. In step314, Y₃*X₁ was computed by multi-cycle multiplier 610. Hence, Y₃*X₁ isstill present in the output register (not shown) of multi-cyclemultiplier 610 and in Step 315 is sent on line 650 to input “1” of MUX620 and MUX 620 is set to “1”. MUX 620 sends Y₃*X₁ on line 670 to inputA of SAU 400. Line 670 is connected directly to the minuend input ofsubtractor 710. Register memory 695 provides X₃ on line 660 to input “0”of MUX 630 with MUX 630 set to “0”. MUX 630 sends X₃ on line 675 toinput B of SAU 400. Line 675 is directly connected to the subtrahendinput of subtractor 710. Subtractor 710 calculates A−B and sends theresult on line 730 to input “0” of MUX 740 with MUX 740 set to“0”. MUX740 sends D (which is A−B=Y₃*X₁−X₃) on line 690 to register memory 695which passes D through on line 635 and provides T₂ on line 640 tomulti-cycle multiplier 610. Multi-cycle multiplier 610 computes D*T₂(which is (Y₃*X₁−X₃)*T₂) and outputs the result on line 650 to registermemory 695 where the result is stored in Y₃.

Step 316 utilizes both multi-cycle multiplier 610 and SAU 400. Registermemory 695 provides Y₃ on line 665 to input “0” of MUX 620 with MUX 620set to “0”. MUX 620 sends Y₃ on line 670 to input A of SAU 400. Line 670is directly connected to the minuend of subtractor 710. Register memory695 provides T₁ on line 635 and Y₁ on line 640 to multi-cycle multiplier610. Multi-cycle multiplier 610 computes and outputs T₁*Y₁ on line 650to input “1” of MUX 630 with MUX 630 set to “1”. MUX 630 sends T₁*Y₁ online 675 to input B of SAU 400. Line 675 is directly connected to thesubtrahend of subtractor 710. Subtractor 710 computes A−B (which isY₃−T₁*Y₁) and provides the result on line 730 to input “0” of MUX 740with MUX 740 set to “0”. MUX 740 sends D (which is Y₃−T₁*Y₁) on line 690to register memory 695 where the result is stored in Y₃.

Step 317 returns the result of the addition of P+Q in Jacobiancoordinates which is (X₃, Y₃, Z₃).

The invention claimed is:
 1. A data cryptographic apparatus comprising:computational logic configured to perform an elliptic curve cryptography(ECC) point addition operation using mixed affine-Jacobian coordinatesover a short Weierstrauss curve of the form y=x³+ax+b where a=−3; aregister memory configured to store a first point in affine coordinatesand a second point in Jacobian coordinates, wherein the register memoryis configured for two temporary storage variables, T₁ and T₂; a modularmultiplier electrically coupled to the register memory, wherein themodular multiplier is configured to perform at most one modularmultiplication for each step in a sequence of steps in the ECC pointaddition operation; and a simple arithmetic processor configured toperform modular subtraction and modular multiplication by two in supportof the ECC point addition operation utilizing two modular subtractors, alogical one bit left shifter to either output A−B−2C for an input ofvariables A, B, and C or A−B for an input of variables A and B, whereinthe simple arithmetic processor is electrically coupled to thecomputational logic, the register memory, and the modular multiplier tooutput a result of the ECC point addition operation in the Jacobiancoordinates.
 2. A mobile device comprising the data cryptographicapparatus of claim
 1. 3. A smartcard comprising the data cryptographicapparatus of claim
 1. 4. The mobile device of claim 2, wherein themobile device is a smartphone.
 5. A method for performing an ellipticcurve cryptography (ECC) point addition operation using mixedaffine-Jacobian coordinates over a short Weierstrauss curve of the formy=x³+ax+b where a=−3 comprising: accepting, with a computational device,as variable input a first point in affine coordinates and a second pointin Jacobian coordinates using a simple arithmetic processor; configuringthe simple arithmetic processor for modular subtraction and modularmultiplication by two utilizing two modular subtractors, a logical onebit left shifter to either output A−B−2C for an input of variables A, B,and C or A−B for an input of variables A and B; enabling a modularmultiplier of the computational device to execute a sequence of steps toperform the ECC point addition operation of the first point and thesecond point, wherein the modular multiplier performs at most onemodular multiplication for each step in the sequence of steps, whereinthe sequence of steps requires no more than two temporary variables; andoutputting, by the computational device, a result of the ECC pointaddition operation in the Jacobian coordinates.
 6. The method of claim5, wherein the computational device is part of a mobile device.
 7. Themethod of claim 6, wherein the mobile device is a smartphone.
 8. Themethod of claim 5, wherein the computational device is part of asmartcard.